Return to Statistical Modelling Section
The data comes from Kaggle, the biggest online platform for machine learning enthusiasts hosting datasets and competitions around data science. More precisely, the data set was part of the final round of SLICED, a data science competition streamed by Nick Wan and Meg Risdal on Twitch.
These data are about loans from the U.S. Small Business Administration (SBA). Using various information about each loan, the amount of default in each loan will be predicted.
The U.S. Small Business Administration (SBA) is a United States government agency that provides support to entrepreneurs and small businesses. The mission of the Small Business Administration is “to maintain and strengthen the nation’s economy by enabling the establishment and viability of small businesses and by assisting in the economic recovery of communities after disasters”. SBA loans are made through banks, credit unions and other lenders who partner with the SBA. The SBA provides a government-backed guarantee on part of the loan. Under the Recovery Act and the Small Business Jobs Act, SBA loans were enhanced to provide up to a 90 percent guarantee in order to strengthen access to capital for small businesses after credit froze in 2008. (Source)
Even though the loans provided in partnership with the SBA are government backed, it is still of relevance to assess the creditworthiness of these small businesses. Let’s see what we can do with Random Forest.
Firstly, I start by loading the data. The first file is the one to be used for training, whereas the holdout will only be used for submission of out-of-sample predictions.
The training data is fairly large, containing information on 83,656 loans with 19 predictors and the response variable of the USD default amount.
names(data)
## [1] "LoanNr_ChkDgt" "Name" "Sector"
## [4] "City" "State" "Zip"
## [7] "Bank" "BankState" "NAICS"
## [10] "ApprovalFY" "NoEmp" "NewExist"
## [13] "CreateJob" "RetainedJob" "FranchiseCode"
## [16] "UrbanRural" "DisbursementGross" "GrAppv"
## [19] "SBA_Appv" "default_amount"
It is very nice to see that there are virtually no missing values in any of the predictors, hence no imputation is required, except for a tiny fraction in the dummy variable NewExist.
colMeans(is.na(data)) %>%
tidy() %>%
rename(pct = x) %>%
mutate(names = fct_reorder(names, pct)) %>%
# filter(pct > 0) %>%
ggplot(aes(pct, names)) +
geom_col(fill = "midnightblue") +
labs(title = "Missing Data In Variables",
subtitle = "Percent missingness calculated for each column
",
y = NULL,
x = NULL) +
scale_x_continuous(labels = scales::percent_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Variables are:
This identifying column should be random and thus not bear any predictive power.
The Name variable contains the names of the businesses that have been granted loans from the SBA. Potentially, these names could contain useful information which can be extracted using word tokenisation at a later stage.
The 15 most frequent ones are:
data %>%
count(Name, sort = T) %>%
head(15)
The Sector variable gives the sector of the business to which the loan was made. There are 20 sectors.
data %>%
count(Sector, sort = T)
Creating a default percentage from default amount divided by gross amount allows me to look at the distribution of defaults for each sector in order to see whether it is an interesting predictor.
data %>%
group_by(Sector) %>%
summarise(GrAppv = sum(GrAppv),
default_amount = sum(default_amount),
default_pct = default_amount/GrAppv,
n = n()) %>%
mutate(Sector = fct_reorder(paste0(Sector, " (n=", n, ")"), default_pct)) %>%
ggplot(aes(default_pct, Sector)) +
geom_point(aes(size = GrAppv)) +
labs(title = "Default Percentage By Industry",
subtitle = NULL,
x = "Default Percentage",
y = NULL,
size = "Total Volume:") +
scale_x_continuous(labels = scales::percent_format()) +
scale_size_continuous(labels = scales::dollar_format(),
range = c(1,5)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
Summary statistics of that default percentage for each industry (not printed) reveal two interesting findings:
The second fact surely has to do with outstanding interest payment, which accumulate on top of the principal. Therefore, assuming a cap at 100% for a default rate does not make sense.
There are a lot of cities, so it would make sense to lump less frequent levels together with step_other.
data %>%
count(City, sort = T)
The same does not hold for states, as there is a lower amount of them and even the least frequent have enough data.
data %>%
count(State, sort = T)
However, Zip codes are too abundant and will likely have to be lumped as well.
There are 314 banks in the data providing loans in cooperation with SBA.
Bank state is similar to borrower state.
The North American Industry Classification System or NAICS is a classification of business establishments by type of economic activity (process of production). Lumping would also make sense with almost 1,200 levels.
data %>%
count(NAICS, sort = T)
The fiscal year definitely has to be an important piece of information regarding default probability, as the financial crisis in ’08 was notorious for loan defaults in the housing sector, tearing down other industries with it. A quick look at the percentage of default amount over gross approved loan amount per year confirms this suspicion:
data %>%
group_by(ApprovalFY) %>%
summarise(approved = sum(GrAppv),
defaulted = sum(default_amount),
default_pct = defaulted/approved) %>%
ggplot(aes(ApprovalFY, default_pct)) +
geom_line(colour = "midnightblue") +
geom_point(colour = "midnightblue") +
labs(title = "Default Percentage For Small Business Loans By Year (U.S. SBA)",
subtitle = "Default percentage calculated as default amount over gross loan amount.",
y = "Default Percentage",
x = "Fiscal Year") +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Additionally, it is also interesting to see that the total loan amount in cooperation with the SBA increased more than six-fold over the period from 1990 to 2004.
data %>%
count(ApprovalFY, sort = T) %>%
ggplot(aes(ApprovalFY, n)) +
geom_col(colour = "white", fill = "midnightblue") +
labs(title = "Observation Count By Loan Approval Year",
subtitle = "Most approved loans precede the financial crisis in '08.",
y = "Approved Loans",
x = "Fiscal Year") +
scale_y_continuous(labels = scales::comma_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
data %>%
group_by(ApprovalFY) %>%
summarise(loans = sum(GrAppv)) %>%
ggplot(aes(ApprovalFY, loans)) +
geom_line(colour = "midnightblue") +
geom_point(colour = "midnightblue") +
labs(title = "Total Loan Amount Approved By Year",
subtitle = "Most approved loan volume precedes the financial crisis in '08.",
y = "Total Approved Loans (USD)",
x = "Fiscal Year") +
scale_y_continuous(labels = scales::comma_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
This was not attributable to larger loan sizes, however, as the individual loan size distribution actually shifted lower leading up to the peak in 2004. The quantity of loans was therefore the overriding factor.
data %>%
ggplot(aes(ApprovalFY %>% as.factor(), GrAppv)) +
geom_boxplot() +
labs(title = "Individual Loan Amount Distribution By Year",
subtitle = "Median approved loans got smaller in size before the crisis.",
y = "Approved Amount Per Loan (USD)",
x = "Fiscal Year") +
scale_y_log10(labels = scales::comma_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
As the mission of the SBA states, most loans are given to small businesses with few employees (between 1-10). Virtually no loans were given to firms with more than 100 employees.
data %>%
ggplot(aes(NoEmp)) +
geom_histogram(fill = "midnightblue", colour = "white") +
labs(title = "Employees Per Business",
subtitle = NULL,
y = "Frequency",
x = "Employees") +
scale_x_log10(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Surprisingly, there are values that take the value zero, which probably mean that the variable is undefined for these observations. The absolute number is negligible, though.
data %>%
count(NewExist)
Similarly to the employee count, the businesses have created few jobs on an individual basis.
data %>%
ggplot(aes(CreateJob)) +
geom_histogram(fill = "midnightblue", colour = "white") +
labs(title = "Jobs Created Per Business",
subtitle = NULL,
y = "Frequency",
x = "Jobs Created") +
scale_x_log10(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
The same holds for retained jobs.
The franchise code also reveals weird numbers that are not explained in the data dictionary, however, these can be dealt with in the recipe, as they will be lumped together.
data %>%
count(FranchiseCode)
Most businesses are in an urban area. However, many are undefined. This information can be used for prediction, in order to see whether it holds any predictive power.
data %>%
count(UrbanRural)
Gross disbursement indicates the amount paid to debtors out of gross approved amount. This can be mutated to a percentage in the recipe.
Gross amount denotes the principal attributed to the debtor, however, it does not require disbursement. A relation of gross amount to number of employees could contain information. Unfortunately, there is no fiscal information on the businesses themselves, which would be most useful for predicting default probability.
The SBA does not always guarantee the same fraction of the loan. This can also be converted to a ratio, potentially containing information.
data %>%
mutate(guaranteed_pct = SBA_Appv/GrAppv) %>%
group_by(ApprovalFY) %>%
summarise(mean_guaranteed = mean(guaranteed_pct),
lower = quantile(guaranteed_pct, 0.1),
higher = quantile(guaranteed_pct, 0.9)) %>%
ggplot(aes(ApprovalFY, mean_guaranteed, group = 1)) +
geom_line(colour = "midnightblue") +
geom_ribbon(aes(ymin = lower, ymax = higher), fill = "midnightblue",
colour = NA, alpha = 0.15, lty = "dashed") +
labs(title = "Fraction Of Loans Guaranteed By The SBA (Government-Backed)",
subtitle = "Confidence intervals show 10th and 90th percentile. Line shows mean.",
y = "Fraction Of Loans Secured By SBA",
x = NULL) +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Interestingly, the fraction of the loans guaranteed by the SBA went down until the financial crisis, indicating either increased risk appetite by the banks or a parameter change from the SBA. In any case, default rates and the guaranteed portion of the loans seem inversely proportional, so this metric should be interesting.
Finally: The target. Most loans fall into the category between USD 10,000 and USD 1,000,000.
data %>%
ggplot(aes(default_amount)) +
geom_histogram(fill = "midnightblue", colour = "white") +
labs(title = "Default Amount (Target Variable)",
subtitle = NULL,
y = "Frequency",
x = "Default Amount (Base 10 Log Scale)") +
scale_x_log10(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::comma_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Let’s take a look at the correlation matrix from the GGally package to gauge relations of numeric predictors with the target variable.
data %>%
select_if(is.numeric) %>%
drop_na() %>%
ggcorr(label = T, label_size = 3)
It does not look like numeric variables are good predictors of the default amount in this setting. However, the correlation only being a linear measure, it might well be that random forest will figure out non-linear relationships that are hidden right now.
Proceeding to look at variance inflation factors:
data %>%
select_if(is.numeric) %>%
lm(formula = default_amount ~ .) %>%
vif()
## LoanNr_ChkDgt Zip NAICS ApprovalFY
## 1.135622 1.023087 1.034957 2.047699
## NoEmp NewExist CreateJob RetainedJob
## 1.023053 1.047597 1.068939 1.056503
## FranchiseCode UrbanRural DisbursementGross GrAppv
## 1.046363 1.851725 18.234405 44.007853
## SBA_Appv
## 22.634641
High VIF for the disbursement and approved amounts are obvious, as these are fractions of one another. For random forest, multicollinearity does not pose a threat, hence they will be ignored at this stage.
With the ideas gathered from the exploratory data analysis, I can now proceed with building the model.
Disclaimer: It is very tricky to get great predictions on this data set due to the lack of company-level information in the predictors. Normally, the balance sheet information and potentially macroeconomic and industry-level variables would be crucial to predict defaults with great precision. As we don’t have any of that here, the focus of this post is not to get the best predictions, but to demonstrate how to squeeze the most out of the predictors that we have in the recipe steps.
First, the data is split into training and testing sets. Also, three-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.
dt_split <- initial_split(data, strata = "default_amount")
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
folds <- vfold_cv(dt_train, v = 3)
Creating the recipe with
As mentioned before, on this data set, there is a need for feature engineering your way to better predictions. Six new features are created using the step_mutate function from the recipes package:
The recipe in the tidymodel framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit.
rf_rec <-
recipe(default_amount ~ .,
data = dt_train) %>%
step_mutate(across(where(is.character), as.factor)) %>%
step_mutate(across(c(ApprovalFY, NewExist, UrbanRural, FranchiseCode,
NAICS, Zip),
as.factor)) %>%
step_impute_mode(NewExist) %>%
step_mutate(sba_guaranteed_percentage = SBA_Appv / GrAppv) %>%
step_mutate(approved_per_employee = GrAppv / (NoEmp + 1)) %>%
step_mutate(
business_size = case_when(NoEmp < 10 ~ "small",
between(NoEmp, 10, 100) ~ "medium",
NoEmp > 100 ~ "big") %>%
as.factor()
) %>%
step_mutate(
bank_same_state = case_when(
as.character(BankState) == as.character(State) ~ 1,
TRUE ~ 0
) %>%
as.factor()
) %>%
step_mutate(pct_disbursed = DisbursementGross/GrAppv) %>%
step_mutate(is_franchise = case_when(FranchiseCode %in% c(1,0) ~ 0,
TRUE ~ 1) %>% as.factor()) %>%
step_rm(FranchiseCode) %>%
step_rm(Zip) %>%
step_rm(LoanNr_ChkDgt) %>%
step_other(City, NAICS, threshold = 0.1) %>%
step_novel(all_nominal_predictors()) %>%
step_tokenize(Name) %>%
step_stopwords(Name) %>%
step_tokenfilter(Name, max_tokens = 30) %>%
step_stopwords(Name) %>%
step_tf(Name) %>%
step_normalize(all_numeric_predictors()) %>%
step_zv(all_numeric())
Next on the list: Setting up the model specifications with tuning options for hyperparameters. Given the high amount of observations in the data, the number of trees should be sufficiently high. However, that comes with a cost: time to train. Given that it’s already hard to get great predictions on this data set due to the lack of company-level information, the number of trees will be kept low at 100.
rf_spec <-
rand_forest(trees = 100,
mtry = tune(),
min_n = tune()) %>%
set_engine(engine = "ranger", importance = "impurity") %>%
set_mode("regression")
In the model specification, you can specify the variable importance, which is calculated based on impurity in this case. Proceeding with setting up the workflow:
rf_wflow <-
workflow() %>%
add_recipe(rf_rec,
blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
add_model(rf_spec)
Setting up tuning grid:
rf_grid <-
grid_max_entropy(finalize(mtry(), dt_train),
min_n(),
size = 5)
rf_grid
Now, the hyperparameters can be trained with parallel computing in order to utilise more available computing power.
unregister_dopar <- function() {
env <- foreach:::.foreachGlobals
rm(list=ls(name=env), pos=env)
}
start_time <- Sys.time()
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
rf_tune <- tune_grid(object = rf_wflow,
resamples = folds,
grid = rf_grid,
metrics = metric_set(mae, rsq, rmse))
stopCluster(cl)
unregister_dopar()
end_time <- Sys.time()
end_time - start_time
## Time difference of 1.588639 mins
We can now look at the training results of the previous step and pick the best configuration of hyperparameters.
# Look at tuning results
rf_tune %>%
show_best(metric = "mae")
# Visualise tuning results
rf_tune %>%
collect_metrics() %>%
filter(.metric == "rsq") %>%
select(mtry, min_n, .metric, mean) %>%
pivot_longer(mtry:min_n,
values_to = "value",
names_to = "parameter") %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "rsq") +
theme_bw()
The best model out of training is selected and used for the final fit. As the sliced competition used MAE as the metric for ranking, the model is gonna be selected accordingly.
rf_final_wflow <- rf_wflow %>%
finalize_workflow(select_best(rf_tune, metric = "mae"))
rf_final_fit <- rf_final_wflow %>%
last_fit(dt_split)
rf_final_fit %>% collect_metrics()
rf_final_fit %>%
pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vi() %>%
slice_max(order_by = Importance, n = 20) %>%
ggplot(aes(Importance, reorder(Variable, Importance))) +
geom_col(fill = "midnightblue", colour = "white") +
labs(title = "Variable Importance",
subtitle = "Only the most important predictors are shown.",
y = "Predictor",
x = "Relative Variable Importance") +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
From the Variable Importance plot, it becomes apparent what the most important variables for default prediction are. Firstly, the fiscal year bears relevance, as the financial crisis marked a rise in default rates. The fraction of the loan guaranteed by the SBA also makes sense, as it acts as insurance for the banks, enabling them to shift risk onto the SBA, potentially incentivising the handing out of higher-risk loans. In this case, this is wanted, as the SBA acts to support riskier, that is smaller businesses. Also, the size of the loan of course correlates with the default amount in absolute terms. Giving out large loans to very small businesses measured by employee size is also an important differentiator. Apparently, the bank and state matters as well, potentially reflecting differences in risk-appetite for banks across different geographic locations. A last one I want to mention here is the bank being in the same state or in a different state than the borrower and the distinction between urban or rural setting, which is not surprising anymore after the EDA, but was initially surprising.
Comparing the model performance based on Mean Absolute Error (MAE) to a prediction of only zeros reveals, that the model performs worse than just assuming that all loans have zero default. The same does not hold for RMSE, which states that our model is better than all zeros. RMSE being a quadratic measure squaring before averaging, I can deduct that the model performs better with regard to avoiding large errors.
# MAE of the final fit
rf_final_fit %>%
collect_predictions() %>%
mae(default_amount, .pred)
# MAE of all zero
rf_final_fit %>%
collect_predictions() %>%
mutate(.pred_zero = 0) %>%
mae(default_amount, .pred_zero)
# RMSE of the model
rf_final_fit %>%
collect_predictions() %>%
rmse(default_amount, .pred)
# RMSE of all zero
rf_final_fit %>%
collect_predictions() %>%
mutate(.pred_zero = 0) %>%
rmse(default_amount, .pred_zero)
This was exactly the problem David Robinson faced in the SLICED Finale. However, his approach was adding a step function for the final predictions, like so:
rf_final_wflow %>%
fit(dt_test) %>%
augment(dt_test) %>%
mutate(pct_default = .pred / GrAppv) %>%
crossing(threshold = seq(0, 1, 0.05),
weight = seq(0, 1, 0.05)) %>%
mutate(.pred = ifelse(pct_default >= threshold, weight * GrAppv, 0)) %>%
group_by(threshold, weight) %>%
mae(.pred, default_amount) %>%
arrange(.estimate)
By adding a percentage threshold of the default amount divided by gross amount of the loan and multiplying the predicted amount by a parameter lower than 1 in combination with the XGBoost model, he was able to beat all zeros by a tiny margin. This, unfortunately, does not hold for this Random Forest model.
With the trained model, I can now make predictions for the holdout dataset. Pulling out the workflow:
rf_trained_wflow <- extract_workflow(rf_final_fit)
And making predictions:
rf_trained_wflow %>%
augment(holdout) %>%
transmute(LoanNr_ChkDgt,
default_amount = ifelse(.pred > 0.25, .pred * 0.8, .pred))
This model ranks at 22/30 on the SLICED championship leader board. It seems that other models like XGBoost with a step function like David Robinson’s submission perform much better.
Conclusively, it can be said that it is hard to get great predictions from this data set, specifically because of the lack of suitable predictors for loan defaults at company level. However, it is very interesting and useful to demonstrate the importance of feature engineering and dealing with predictions after the final fit, for instance with a step function.
I hope this post has been interesting to you. In case of feedback, feel free to reach out.
Thank you for reading!
A work by Mathias Steilen